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Summary 


1. Many animals communicate using sequences of discrete acoustic elements which can be complex, vary in their 
degree of stereotypy, and are potentially open-ended. Variation in sequences can provide important ecological, 
behavioural or evolutionary information about the structure and connectivity of populations, mechanisms for 
vocal cultural evolution and the underlying drivers responsible for these processes. Various mathematical tech- 
niques have been used to forma realistic approximation of sequence similarity for such tasks. 

2. Here, we use both simulated and empirical data sets from animal vocal sequences (rock hyrax, Procavia capen- 
sis; humpback whale, Megaptera novaeangliae; bottlenose dolphin, Tursiops truncatus; and Carolina chickadee, 
Poecile carolinensis) to test which of eight sequence analysis metrics are more likely to reconstruct the informa- 
tion encoded in the sequences, and to test the fidelity of estimation of model parameters, when the sequences are 
assumed to conform to particular statistical models. 

3. Results from the simulated data indicated that multiple metrics were equally successful in reconstructing the 
information encoded in the sequences of simulated individuals (Markov chains, n-gram models, repeat distribu- 
tion and edit distance) and data generated by different stochastic processes (entropy rate and n-grams). However, 
the string edit (Levenshtein) distance performed consistently and significantly better than all other tested metrics 
(including entropy, Markov chains, n-grams, mutual information) for all empirical data sets, despite being less 
commonly used in the field of animal acoustic communication. 

4. The Levenshtein distance metric provides a robust analytical approach that should be considered in the com- 
parison of animal acoustic sequences in preference to other commonly employed techniques (such as Markov 
chains, hidden Markov models or Shannon entropy). The recent discovery that non-Markovian vocal sequences 
may be more common in animal communication than previously thought provides a rich area for future research 
that requires non-Markovian-based analysis techniques to investigate animal grammars and potentially the ori- 
gin of human language. 
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proposed that sequences contain detailed communicative 
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information such as individual identity, for example bottlenose 


Many animals communicate using sequences of discrete acous- 
tic elements, the best known example being bird song, which is 
composed of multiple notes combined in a distinctive order. 
These sequences are often complex, non-stereotyped and 
potentially open-ended; that is, individuals may use an almost 
unlimited repertoire of sequences by making subtle or large 
variations to the order of notes (reviewed in Catchpole & Slater 
2003). The role of such sequences varies among species. In 
some cases, sequences appear to advertise male quality 
through sequence complexity, for example in marsh warblers, 
Acrocephalus palustris (Darolova et al. 2012); zebra finches, 
Taeniopygia guttata (Searcy & Andersson 1986; Neubauer 
1999; Holveck et al. 2008); and song sparrows, Melospiza 
melodia (Pfaff et al. 2007). In other cases, researchers have 
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dolphins, Tursiops truncatus (Sayigh et al. 1999). It is also pos- 
sible that in some species, acoustic sequences are essentially 
stochastic with little significance to their precise composition. 
Identifying the role of acoustic sequences in a particular 
species often involves comparing sequences within and 
between individuals, as well as within and between popula- 
tions, so that the nature of the variation can be quantified 
and potentially correlated to ecological or behavioural fac- 
tors. The task of comparing acoustic sequences presumes 
an unequivocal and globally relevant measure of sequence 
similarity, or difference. However, in practice, no such met- 
ric exists. It could be postulated that a measure of 
sequence similarity should reflect the proximal processes 
taking place in the brains of intended conspecific signal 
receivers; that is, the best measure of sequence similarity is 
the one used by the animal itself (Kershenbaum et al. 
2014a). Given that such knowledge is essentially hidden in 
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practice, various mathematical techniques have been used 
to form a realistic approximation of signal similarity 
(Ashby & Perrin 1988; Young & Hamer 1994; Navarro 
2001; Ranjard 2010). It is possible to categorise similarity 
measures into two distinct approaches. First, it is usually 
possible to characterise a sequence by measuring a small 
number of metrics that are inherent to the sequence itself; 
examples of this include length, or entropy (Freeberg & 
Lucas 2012). Sequences can then be compared by calculat- 
ing the sum of square differences between each of these 
metrics. This is equivalent to representing each sequence as 
a ‘feature vector’ in some relatively compact feature space, 
and measuring the distance between two sequences as the 
Euclidean distance between their two feature vectors. While 
this method is straightforward, there is an assumption that 
it is possible to represent every sequence in a compact 
way, that is, that some sufficiently large combination of 
metrics can ‘summarise’ the properties of a sequence in a 
biologically meaningful way. However, it is far from clear 
that there exists a compact, yet exact, mathematical repre- 
sentation of a sequence, short of the trivial task of writing 
down the entire sequence of elements and attempting to 
measure the Euclidean distance between the full representa- 
tions of two sequences, which is unlikely to produce the 
desired results. An alternative approach is to use aggregate 
techniques that measure properties of a large number of 
sequences and summarise the characteristics of a corpus. 
For example, sequence transition tables (TTs) and element 
frequency histograms have been used in previous studies 
(Jin & Kozhevnikov 2011). In these cases, each vector in 
feature space represents a collection of sequences, and the 
Euclidean distance between vectors measures the difference 
between the sequences from two sets of vocalisations, 
rather than between individual sequences. However, it is 
questionable whether any of these techniques, individual or 
aggregate, can represent the nature of the sequences with 
adequate fidelity. Since we do not know what cognitive 
processes an animal uses to interpret such sequences, we 
cannot be sure that any particular summary metric accu- 
rately reflects the interpretation of the sequence by the 
receiving individual. We refer to all of these above metrics 
as ‘unary’, as they are derived from measurements on each 
string sequence in isolation, even if distances are eventually 
calculated on an aggregate of sequences. 

Secondly, it is possible to measure the difference between a 
pair of sequences directly (Levenshtein 1966), thereby bypass- 
ing the construction of a feature space, and generating a series 
of pairwise comparisons between sequences. Analysing the 
sequence of elements in animal vocalisations can be considered 
analogous to analysing the sequence of nucleotides in DNA, 
and some non-aggregate techniques have been borrowed from 
the field of bioinformatics to capture the similarity or differ- 
ence between two sequences. This approach provides a direct 
measure of pairwise differences, in the form of a distance 
matrix, but without a Euclidean feature space. We refer to 
these metrics as ‘binary’, as they can only be calculated as a 
pairwise comparison between exactly two sequences. Binary 
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difference measures are attractive, as they do not rely on the 
fidelity of a particular unary metric in representing the proper- 
ties of a sequence. Rather, binary metrics are an unequivocal 
measure of the similarity/difference between two sequences, 
although it cannot be assumed that this measure of similarity is 
the same as that used by the animal itself in distinguishing 
between sequences. Such metrics have long been proposed for 
the analysis of birdsong (Bradley & Bradly 1983; Ranjard et al. 
2010), but have not been widely adopted. One disadvantage of 
binary metrics is that a number of common machine learning 
algorithms often used for clustering the results of similarity 
analyses (e.g. k-means, neural networks) rely on data presented 
as a Euclidean feature space, although there are exceptions, for 
example Ranjard & Ross (2008). To use such clustering tech- 
niques, it would be necessary to derive a series of feature vec- 
tors from the binary metric distance matrix. This can be done 
using techniques such as multidimensional scaling or principal 
component analysis to convert a distance matrix to feature vec- 
tors. 

Here, we compare the performance of eight different meth- 
ods for analysing animal vocal sequences, using both aggregate 
statistical metrics and a direct pairwise distance measure. We 
use simulated and empirical sequences to test which approach 
is more likely to reconstruct the information encoded in the 
sequences, and to test the fidelity of estimation of model 
parameters when the sequences are assumed to conform to 
particular statistical models. This direct comparison of a num- 
ber of commonly employed analytical algorithms provides a 
comprehensive evaluation of the utility of these approaches to 
real-world data sets and demonstrates the utility of comparing 
at least two different methods when assessing novel algorithms 
to ensure that results are robust under a range of analytical 
approaches. 


Materials and methods 


We performed two sets of tests (viz. artificial and empirical) to evaluate 
the performance of each metric. In the first tests, we generated artificial 
random sequences and used the different similarity metrics to recon- 
struct the parameters used to generate these sequences, and the stochas- 
tic model types. In the second set of tests, we analysed recordings of 
animal vocalisations and used both unary and binary difference metrics 
to determine contextual information known to exist in these sequences. 
We used the signature whistles of the bottlenose dolphin (Sayigh et al. 
2007; Kershenbaum, Sayigh & Janik 2013), to reconstruct individual 
identity, and the songs of the rock hyrax, Procavia capensis (Kershen- 
baum et al. 2012); the humpback whale, Megaptera novaeangliae (Gar- 
land et al. 2012); and the calls of the Carolina chickadee, Poecile 
carolinensis (Freeberg 2012), to reconstruct geographical dialect. In the 
case of the hyrax, humpback whale and chickadee, the calls consisted 
of a sequence of discrete acoustic elements. In contrast, bottlenose dol- 
phin whistles are often produced in isolation (rather than as a sequence 
of whistles); therefore, we analysed the sequence of frequency modula- 
tion components (e.g. up, down, constant) within whistles, taking these 
modulation components as the acoustic elements (for more details see 
Kershenbaum, Sayigh & Janik 2013). In both our analysis of artificial 
sequences, and empirical animal vocal sequences, we evaluate a number 
of similarity metrics, both binary and unary. Humpback whale song 
recordings are held at the University of Queensland, Australia, and by 
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Operation Cetaces in Noumea, New Caledonia. Dolphin whistle 
recordings are held at Woods Hole Oceanographic Institution (see 
Data accessibility section for contact details). Before providing details 
of the simulation experiments and empirical data analysis, we describe 
each of the metrics used. 


BINARY METRIC 


Levenshtein distance 


The Levenshtein distance (LD; Levenshtein 1966) is a type of 
string edit distance metric, as it provides a quantitative measure- 
ment of the difference between two string sequences regardless of 
string length. Specifically, the LD measures the minimum number 
of point operations (additions, deletions and substitutions) needed 
to convert one string into another (Levenshtein 1966). By com- 
paring the position of elements within a string and calculating 
the number of changes that it takes to change one string into 
the other, this metric relies more on the sequence of elements 
and less on the overall structural pattern. It has been used exten- 
sively in other fields, for example bioinformatics (Likic 2008) and 
text search/retrieve (Reis et al. 2004), and in a small number of 
previous studies of animal sequences (Garland et al. 2012, 2013; 
Kershenbaum ef al. 2012; Krull et al. 2012), and is related to the 
better known dynamic time warping algorithm (Buck & Tyack 
1993). However, LD itself remains somewhat unknown in the 
field of animal acoustic communication. In practice, string edit 
distances are often paired with string alignment algorithms or 
additional standardisations, particularly when the strings being 
compared are of different lengths: Fig. 1; see Garland et al. 
(2012) and Kershenbaum et al. (2012) for additional information 
on metric calculation. Importantly, the LD forms the basis of the 
Needleman—Wunsch string alignment (Needleman & Wunsch 
1970; Likic 2008) that is used extensively in bioinformatics 
research to compare sections of DNA. In our implementation of 
the LD algorithm, we assign an equal cost (of 1) to any correc- 
tion operation (addition, deletion, substitution), no cost (0) for a 
matching element and no cost for differences in sequence lengths 
after optimal alignment. 

Although other binary metrics exist apart from LD, they are in gen- 
eral unsuitable for the task at hand. For example, the Hamming dis- 
tance requires sequences of the same length, and the most frequent k 
characters simply provides a count of the most common symbol/ele- 
ment. These therefore provide less information than the LD metric. 


(a) (c) 
WOSQSQS TCQQQOSCOCSCSC 
XXXXXXX | XXXX | XX | xxxxx 
QSQSQS TTTTCQQQQWWWOQ 
(b) (d) 
WQSQSQS WQQ909990009000 


xITIITI SIU 
QSQSQS WSQQ0099900009 


Fig. 1. Examples of string alignment and edit distance. (a) Two una- 
ligned strings with a Levenshtein distance (LD) of 7. (b) After aligning 
the strings to minimise the difference, LD = 1. (c) Two hyrax bouts 
which are highly different, LD = 11. (d) Two bouts which are very sim- 
ilar, LD = 1. Reproduced from (Kershenbaum et al. 2012). 


UNARY METRICS 


Transition table 


Acoustic sequences have often been modelled as a Markov chain 
(Briefer et al. 2010; Berwick et al. 2011), in which the probability of 
a particular element occurring depends only on the preceding ele- 
ment (or sometimes, more than one preceding element). These condi- 
tional probabilities of each element, given the preceding element(s), 
can be expressed as a transition matrix T, in which the element 7;; 
represents the probability of the element j occurring after the element 
i. For a sequence consisting of C distinct element types, a C x C 
transition matrix can be estimated from empirical data. When com- 
paring two sequences A and B, the similarity between the transition 
matrices T4 and Tg is an indication of the similarity between the 
sequences (Jin & Kozhevnikov 2011). To calculate a difference metric 
Drr = f(T4, Tg), we can express each matrix as a C? dimensional 
feature vector V, where the elements of the vector are equal to the 
elements of the transition matrix T, that is V = T(-). We then calcu- 
late the Euclidean distance between the two vectors derived from 
sequences A and B: 


Drr(A, B) = 4/ X (Va — Va)’. 


However, such a metric would not be expected to produce a mean- 
ingful measure for sequences composed of non-overlapping element 
types (e.g. ABCABC and DEFDEF). Therefore, we sort vectors V4 
and Vz in order of transition probability before comparison. This 
allows a comparison of transition probability distributions, indepen- 
dent of element type. 


N-gram distribution 


Researchers have previously proposed that an important property of 
animal sequences is the nature of repeating units within the sequence 
(Cane 1959; Pruscha & Maurus 1979; Kershenbaum et al. 2014a,b). A 
sequence of length L consists of L — n + 1 subsequences of length n. 
Thus, the five-element sequence ABBAC consists of 5— 2+ 1=4 
two-element subsequences: AB, BB, BA and AC. For a sequence con- 
sisting of C distinct element types, there are a total of C” distinct n-ele- 
ment possible subsequences. The vector of subsequence frequencies 
P(i € C”) can be considered a feature vector, and the distance between 
two strings calculated in a similar way to that shown above: 


Dya(A, B) = Paw — Px)’. 


In the following analyses, we chose the n-gram (NG) distribution for 
n = 3, as this provides a good balance between coverage and diversity. 
For a comparison of different length NGs in analysing birdsong, see 
Jin & Kozhevnikov (2011). 


Shannon entropy 


Information theory approaches to analysing animal vocal communica- 
tion have become popular in recent years. One metric that is simple to 
understand and easy to apply is the Shannon entropy (SE) (Shannon 
et al. 1949), and this has been used in a number of studies to measure 
the complexity of animal vocal sequences (McCowan, Hanser & Doyle 
1999; Da Silva, Piqueira & Vielliard 2000; Suzuki, Buck & Tyack 2006; 
Doyle et al. 2008). SE measures the unpredictability of a sequence, or 
the lack of uniformity of a sequence, so that a completely predictable 
sequence (e.g. consisting of the same element repeated over and over) 
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would have an entropy of zero, whereas a completely unpredictable 
(random) sequence would have an entropy of one. The equation for SE 
Hisas follows: 


H=- X Piloge Pi, 
i€1...C 
where P; is the probability of element i, drawn from a set of the C ele- 
ments occurring in the union of all sequences. 
Our SE metric compares two sequences by taking the ratio of the 
Shannon entropies of the sequences A and B: 


Dsg (A, B) = H4/Hpg where Hy< Hp: 


Although SE is calculated as a single comparison between single 
measurements on two sequences (in contrast to the TT and NG 
metrics described above, both of which result in multiple measure- 
ments on a single sequence), SE should still be considered a unary 
metric, because it does not directly measure the distance between 
two sequences, but rather the difference in a derived metric from 
each. 


Entropy rate 


Entropy rate (ER) has been shown to be a useful metric for measuring 
vocal sequence complexity (Kershenbaum 2013). ER is derived from 
the TT of a sequence and can be thought of as a measure of TT diver- 
sity, that is the extent to which different transitions between notes are 
of uniform or non-uniform probability. Given a TT T;; as described 
above, ER is defined as: 


ER =- $ mS) TylogTy, 

Ele G JELC 
where 7; is the stationary probability of element 7, that is the overall 
probability of į occurring in the sequence; see Kershenbaum (2013) for 
additional information on metric calculation. As with SE, we define a 
metric Dgr for the difference between sequences A and B: 


Der(A, B) = ER4/ERg where ER, <ER». 


Repeat distribution 


The repeat number distribution was used in a recent study to compare 
the similarity between natural and synthetic songs of Bengalese finches, 
Lonchura striata var. domestica (Jin & Kozhevnikov 2011). It is an 
aggregate measure, calculated on a corpus of sequences. For each set of 
sequences, a histogram is generated showing the probabilities P„ that 
any element occurred in isolation (n = 1) was repeated twice (n = 2), 
three times (n = 3) and so on. As with the NG distribution, we define a 
metric that measures the difference between two such histograms, gen- 
erated from sequences A and B, where P4 and Pz are the feature vec- 
tors of sequences A and B, comprising the repeat distributions (RDs) 
for all the elements: 


Dev(A, B) = \/_(Pa — Pa)’. 


Mutual information 


Mutual information (MT) is an information theory measure that can be 
applied easily to quantify the similarity of two sequences. MI combines 
both measures of the inherent complexity in a sequence (via SE), and 
the joint entropy of the sequences, which measures the probability that 
a particular pair of elements will occur at the same point in two 
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sequences; see Kershenbaum ef al. (2012) for additional information 
on metric calculation. MI is defined as follows: 


MI = H(A) + H(B) — 5 So pis log pij, 
i j 


where H(A) is the SE of sequence A, H(B) is the SE of sequence B, 
and p;;is the probability that elements i and j occur at the same point in 
sequences A and B. As with SE, we define a metric Duy for the differ- 
ence between sequences A and B: 


Dyn = MI4/MIp where MI; <MIz. 


Lempel-Ziv 


The Lempel—Ziv (LZ) complexity (Lempel & Ziv 1976) is an important 
algorithm used for data compression, as it is a measure of the number 
of distinct patterns in a sequence. As a metric of sequence complexity 
and an approximation to Kolmogorov complexity (Evans & Barnett 
2002), it is potentially a useful indicator of the diversity of an animal 
vocal sequence. Although it has not been widely used in animal studies, 
Suzuki, Buck & Tyack (2006) suggested the use of the LZ metric for the 
analysis of humpback whale song, and Kershenbaum (2013) showed 
that the LZ metric outperformed SE in quantifying realistic length 
acoustic sequences. LZ complexity was calculated using the Applied 
Nonlinear Time Series Analysis library for Matlab (Small 2005). 


clogL 


~ Liog K’ 





where c is the number of distinct substrings in a sequence of length L, 
and K is the maximum number of possible distinct substrings. 


SEQUENCES FORANALYSIS 


Artificial sequences 


In the first test, we evaluated the utility of each of the similarity metrics 
by their ability to identify correctly the stochastic process model from 
which artificial sequences were generated. We generated artificial 
sequences using three different stochastic processes, often used to 
model animal vocal sequences (Kershenbaum eż al. 2014b): the zero- 
order Markov process (ZOMP), the first-order Markov process 
(FOMP) and the semi-Markov renewal process (RP). The ZOMP is an 
independent stochastic process, in which the probability of any particu- 
lar element occurring at a particular point in a sequence is determined 
solely by the prior probability of that element. In the FOMP, element 
probabilities are determined by a TT, where the probability of a partic- 
ular element depends on the immediately preceding element. The RP 
has been shown to be a more realistic model of animal vocal sequence 
production (Kershenbaum ef al. 2014b) in which the number of 
repeated elements is drawn from a Poisson distribution, rather than 
being determined by the diagonal of a TT. In each case, we examined 
10 sequences of 10 elements each, drawn from five possible elements 
(A-E). We generated 30 sequences, 10 from each of the stochastic pro- 
cesses, ZOMP, FOMP and RP. The ZOMP was modelled by selecting 
five random prior probabilities, one for each element type, and renor- 
malising to sum to unity. We then generated the sequences by selecting 
elements according to these prior probabilities. The FOMP was mod- 
elled by generating a random 5 x 5 TT ina similar way to the ZOMP 
prior probabilities, so that the rows of the transition matrix summed to 
unity. A random initial element was chosen for each 10-element 
sequence, and the remaining nine elements in each sequence were 
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chosen randomly according to the probabilities in the TT. The RP was 
modelled in a similar way to the FOMP, except that for each element 
generated, a random number of repeats were drawn from a Poisson dis- 
tribution with mean five (to give 95% confidence of <9 repeats). Having 
generated 30 sequences of 10 elements, we then calculated a 30 x 30 
distance matrix for each of the similarity metrics. We then used an 
adaptive resonance theory (ART) artificial neural network to cluster 
these 30 points into natural groupings, setting a maximum of 100 possi- 
ble clusters. ART networks have been used in a number of previous 
studies to cluster data derived from animal vocalisations (Janik 1999; 
Deecke & Janik 2006; Quick & Janik 2012). We then calculated the 
normalised mutual information (NMI) as a metric of goodness of 
clustering (Zhong & Ghosh 2005), by comparing the composition 
of the generated clusters H(Y) with the true generation process of 
each H(Y). Thus, NMI indicates the proportion of uncertainty 
predicted by the metric. We then repeated this process 100 times 
using new random transition matrices, generating 3000 sequences 
in total. 

In the second test using artificial sequences, we simulated ‘individu- 
als’ by generating 100 random RP transition matrices, and from each 
of them producing a set of 10 sequences of 10 elements each. We used 
the RP generation process, rather than a Markovian ZOMP or FOMP, 
as the RP more reliably describes many types of animal vocal sequences 
(Kershenbaum et al. 2014b). Each sequence generated from a single 
transition matrix would be expected to be more similar to other 
sequences from the same transition matrix, than sequences generated 
by a different random transition matrix; therefore, we used a similar 
clustering approach as in the stochastic process analysis above. We cal- 
culated the 100 x 100 distance matrix for each similarity metric, 
obtained by comparing the sequences from each of the 100 transition 
matrices, and clustered the results as before, measuring the NMI as an 
indication of clustering success. 

For a final test using artificial sequences, we examined the effect of 
typical sample sizes (number of sequences) on each of the similarity 
metrics. Using the sequences generated in the individual simulation 
above, we varied the number of sequences analysed from one to ten, 
recalculated the distance matrices and clustering and measured the 
NMI. 


Animal sequences 


We tested the performance of the above metrics using empirical 
sequences of animal vocalisations, where those sequences are 
thought to contain information that is known a priori. Very few 
examples exist where contextual information is objectively known to 
exist in animal vocal sequences. However, the signature whistles of 
bottlenose dolphins have been shown to encode individual identity 
in the sequence of up-down frequency shifts, known as a Parsons 
code (Kershenbaum, Sayigh & Janik 2013). We used a data set con- 
sisting of 400 signature whistles, 20 from each of 20 individual dol- 
phins, recorded during capture-release events; see Sayigh et al. 
(2007) and Kershenbaum, Sayigh & Janik (2013) for additional 
details. We converted each whistle into a 9-element Parsons code, 
with seven possible element values (‘large drop’, ‘medium drop’, 
‘small drop’, ‘no change’, ‘small rise’, ‘medium rise’ and ‘large rise’). 
We then calculated distance matrices using each of the similarity 
metrics described above and clustered using an ART network. For 
the calculation of NMI, we compared the generated clusters to the 
known clusters of individual identity. As empirical data do not 
allow the generation of unlimited data sets as with artificial 
sequences, we estimated confidence intervals for each of the 


empirical data sets by randomly selecting 80% of the calls for clus- 
tering and calculation of NMI and repeated this process 100 times. 

We analysed three further empirical data sets for which contextual 
information in vocal sequences has been proposed. The first data set 
used recordings of humpback whales (for details see Garland et al. 
2012), the second data set used recordings of rock hyraxes (Kershen- 
baum et al. 2012) and the third set Carolina chickadees (Freeberg 
2012). Previous studies have shown that in the humpback whale, rock 
hyrax and Carolina chickadee, song syntax varies according to the geo- 
graphical origin of the population. For example, not only does chick- 
adee song syntax vary between locations, but there appear to be 
different functional use of certain sequences in the different populations 
(Freeberg 2012). 

The humpback whale data set consisted of 202 songs composed of 
20 different element types (themes), recorded from 42 individuals. 
Humpback whale song is a complex, stereotyped, repetitive, long, male 
display that has multiple levels of hierarchy in its organisation (Payne 
& McVay 1971; Herman & Tavolga 1980; Payne & Payne 1985). A few 
sounds (units) are arranged in a stereotyped phrase which is repeated 
multiple times to make a theme (Payne & McVay 1971). A number of 
themes, sung in a particular order, are combined to form a song. The 
order and content of the themes are highly stereotyped, and all males 
within a population adhere to the same arrangement and content of the 
song at any given time as the display is constantly changing (Frumhoff 
1983; Payne, Tyack & Payne 1983; Payne & Payne 1985). This analysis 
focused on the theme level in the hierarchical arrangement of hump- 
back whale song. Each string therefore represented the sequence of 
themes (elements) that comprised a song, for example theme 1, theme 
2, theme 3, theme 4 and theme 5; see Garland et al. (2012) for further 
information and example sequences. This level within the hierarchy 
takes into account information on the sequence of units and the repeti- 
tion of phrases at a higher level, but does not examine these lower levels 
explicitly. Strings were classified according to their geographical loca- 
tion: New Caledonia, Vanuatu, or eastern Australia, and this geo- 
graphical origin was compared to the clusters generated by the ART 
network. Humpback whale song is constantly changing and has been 
shown to undergo complete song revolutions in this region (Noad et al. 
2000; Garland et al. 2011). The current analysis incorporates two dif- 
ferent song types (lineages) that contain different themes (vocabulary), 
and are present in these populations at various points over the 4 years 
of recording. Therefore, each metric must be robust to the underlying 
transmission dynamics of this display. 

The hyrax data consisted of 1130 song sequences composed of five 
different element types, recorded from a single individual at each of 18 
different locations in Israel. The Carolina chickadee data consisted of 
1184 sequences of calls, recorded from 60 sites in the states of Tennessee 
and Indiana, USA. Links to these data sets are available in the Data 
accessibility section. 


Results 


ARTIFICIAL SEQUENCES 


For sequences generated by different stochastic processes, the 
ER metric provided the best clustering, with a NMI value of 
0-518 + 0-005 (standard error; Fig. 2a), while the binary LD 
metric gave a NMI of 0-476 + 0-006. A post hoc Tukey test fol- 
lowing anova showed significant differences between the NMI 
scores of these two metrics. All other metrics produced signifi- 
cantly lower NMI values. 
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Fig. 2. Results of the normalised mutual information (NMI) scores for each metric using (a) synthetic processes and (b) synthetic individuals. Metric 
labels: Levenshtein distance (LD), repeat distribution (RD), transition table (TT), Shannon entropy (SE), Lempel—Ziv (LZ), N-gram (NG), Mutual 
information (MD and entropy rate (ER). A-F indicate post hoc Tukey groupings. 


Results from clustering sequences of simulated ‘individuals’ 
(sequences generated by stochastic processes with similar 
parameters) indicated that NG produced the highest NMI 
score 0-751 + 0-001, while the LD, RD and TT metrics all pro- 
duced high but slightly lower NMI scores (>0-7; Fig. 2b), with 
no significant differences among the NMI values of these three 
metrics. 

Both the LD and NG metrics that performed well on the 
above clustering tasks were also robust to sample size (Fig. 3). 
Most other metrics were also relatively unaffected by sample 
size. However, the RD performed poorly at smaller sample 
sizes (<4), and the MI declined with increasing corpus size (>2). 


ANIMAL SEQUENCES 


When clustering to reconstruct the individual identity from 
bottlenose dolphin signature whistles, the LD performed sig- 
nificantly better than all other tested metrics, with an NMI of 
0-661 + 0-001 (Fig. 4a). The NG distribution also performed 
well, with an NMI of 0-63 + 0-001. Clustering of the hump- 
back whale song data to indicate population (geographical) 














0-8 T 
NG 
LD 
0-7 FA: = 
LZ 
RD 
0-6 4 
0-5 4 
3 
2 
0-4 ER 4 
SE 
0-3 = 
0-2 4 
MI 
0-1 4 L 1 1 4 
0 2 4 6 8 10 12 
Corpus size 


Fig. 3. Results of the effect of sample (corpus) size on the normalised 
mutual information (NMJ) scores (standard error) for each similarity 
metric. Metric labels are the same as Fig. 2. 





origin showed the LD again performed significantly better 
than all other tested metrics (NMI of 0-491 + 0-005; Fig. 4b). 
The NG provided the second best, although significantly 
poorer, metric (NMI of 0-367 + 0-005). All metrics performed 
poorly in clustering the geographical origin of hyrax songs; 
however, the LD metric was again significantly better than all 
others tested (NMI 0-1684 + 0-001, compared to the next best 
NMI of 0-130 + 0-001 for TT; Fig. 4c). Clustering of the 
chickadee data to distinguish between birds recorded in Ten- 
nessee and those recorded in Indiana showed the LD per- 
formed significantly better than all other metrics (NMI of 
0-450 + 0-001; Fig. 4d), followed by NG (NMI 
0-369 + 0-001). 


Discussion 


We analysed the performance of eight different techniques 
from two broad approaches, to investigate the utility of each 
approach in the comparison of animal sequences. The unary 
and binary metrics performed similarly well in the artificial 
sequence tests, with the ER metric slightly better than the LD 
binary metric, in distinguishing between data generated by dif- 
ferent stochastic processes, and NG slightly better in distin- 
guishing simulated individuals. However, the LD metric 
performed significantly better than all other tested metrics 
when presented with empirical animal sequences. This result 
emphasises that caution should be used when using artificially 
generated sequences based on simple stochastic models to sim- 
ulate animal vocal sequences. Recent work has shown that 
assumptions of simple models for animal vocal production are 
likely to be inaccurate (Kershenbaum et al. 2014b), and similar 
conclusions have been indicated for cetacean song (Miksis- 
Olds et al. 2008). The difference between metric performance 
on artificial and on empirical data is striking. Little is known of 
the cognitive mechanisms by which animals encode and decode 
information in vocalisations (Thornton, Clayton & Grodzinski 
2012); researchers must rely on isolated examples where infor- 
mation content is known a priori to draw conclusions about 
which analytical techniques are best suited for vocal sequence 
data. Our results clearly show that the LD metric outperforms 
other metrics on empirical data, despite performing less effec- 
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tively on simulated data. This indicates that the sequential 
order of the sequences varied across location/individual, while 
the level of complexity is similar. The LD was the metric of 
choice for clustering dolphin signature whistles into individu- 
als, humpback whale song into populations, hyrax songs into 
geographical region and chickadee calls into state of origin. 
Analysis of the sensitivity of the different metrics to sample size 
showed that most of the metrics that performed well across the 
data sets (LD, NG and LZ) were also robust to sample size. 

Results from the current paper in combination with previ- 
ous work (Helweg et al. 1998; Eriksen et al. 2005; Tougaard & 
Eriksen 2006; Garland et al. 2012, 2013) highlight the success 
of the LD metric in the analysis of sequence content and com- 
parison of humpback whale song. A large body of work has 
previously shown that song differences among humpback 
whale populations can indicate geographical origin of a singer 
(Payne & Guinee 1983; Helweg et al. 1998; Garland et al. 
2015). Despite dynamic song transmission in the South Pacific 
region, fine-scale song differences allow the identification of 
population origin (Garland et al. 2011, 2012, 2013, 2015). The 
current paper examined the theme sequences (i.e. a set of 
phrases under a single label) as part of the largest analysis to 
date of sequence comparison algorithms for humpback whale 
song (Garland et al. 2013), which indicated the LD outper- 
formed all other tested metrics. We suggest when comparing 
song sequences, the LD metric should be employed preferen- 
tially, while if the complexity or information content of each 
song is the focus of study, the researcher should employ other 
techniques such as entropy. 

Previous studies of sequence comparison in hyrax song 
(Kershenbaum et al. 2012) have shown geographical variation 
in sequence structure using the LD metric, as these findings 
were supported by application of an unrelated (unary) metric, 
MI. In the current study, MI performed very poorly on both 
simulated and empirical data, although MI performance was 
somewhat better on the hyrax data than on the other data sets. 
This implies that the aspect of the sequences that is measured 


by MI does not vary in correlation with geographical location 
or individual. While not all studies can compare large numbers 
of analytical algorithms, this emphasises the utility of compar- 
ing at least two different techniques when assessing novel algo- 
rithms, to ensure that results are robust under a range of 
analytical approaches. 

Despite all tested metrics performing poorly in the assess- 
ment of geographical origin in hyrax song, the LD metric was 
significantly better than all others. In previous work Kershen- 
baum et al. (2012) measured the correlation between sequence 
similarity and the distance between populations, rather than 
classification success, and the latter suggests that distinct dia- 
lects are not present in the hyrax. Rather, small but significant 
differences are present between all pairs of populations, 
depending on geographical isolation. In contrast, humpback 
whales, chickadees and bottlenose dolphins show strong dis- 
crimination between in-group and out-group sequences, indi- 
cating that the differences between the vocal sequences of 
different individuals or populations are much more marked. 
This may indicate an adaptive role to distinctive vocalisations 
in dolphins and whales, such as individual identification (Janik 
& Slater 1998; Janik, Sayigh & Wells 2006; Quick & Janik 
2012), while in chickadees adaptive, developmental and phylo- 
genetic explanations for regional dialects have been suggested 
(Freeberg 2012). Humpback whale song is hypothesised to 
contain information about the reproductive fitness and popu- 
lation origin of the signaller (Payne & Guinee 1983; Helweg 
et al. 1992). Hyrax song complexity is not thought to contain 
contextual information beyond male fitness (Koren & Geffen 
2009; Demartsev et al. 2014), although this assumption is cur- 
rently untested. In contrast, dolphin signature whistles are 
known to be individually distinctive whistles that can be identi- 
fied by the unique pattern of frequency modulations (Janik, 
Sayigh & Wells 2006). The characterisation of signature whis- 
tles based on a 7-element Parsons code in a previous study 
(Kershenbaum, Sayigh & Janik 2013) allows individual identi- 
fication of the whistler. The LD significantly outperformed all 
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other models in clustering to reconstruct not only the individ- 
ual identity from signature whistles, but the geographical ori- 
gin for humpback whale song, chickadee calls and hyrax song, 
highlighting the importance of evaluating different metrics 
with a priori information. 

One likely explanation for the higher performance of the 
LD metric is that it alone among the metrics analysed uses a 
direct comparison of the vocal sequences between samples, 
thereby using more information about the sequences than the 
other metrics. The LD metric by design can solely be employed 
to compare two strings and it excels at this task; it does not pro- 
vide an understanding of the information content within each 
string, or the sequence structure. By necessity, this means that 
LD also compares the vocabularies of a pair of sequences, and 
therefore, two sequences that are based on the same set of 
sequence elements are likely to have a lower LD value than 
two sequences that are composed of different elements, but 
have similar sequence structure. Regional differences in the 
vocabulary (e.g. humpback song themes) provide important 
information on the connectivity of populations at a broadscale 
despite an overall similarity in song structure (hierarchical 
arrangement). To establish the influence of overlapping, 
vocabulary is beyond the scope of this paper (although two of 
the three humpback populations switched between two vocab- 
ularies — song types — over the course of this study), but we pre- 
sent as Supporting information (Fig. S1) the element 
distributions of the different data sets, which in most cases were 
quite consistent. 

Sample sizes can be constrained in the study of wild animals 
and particularly in marine mammal studies. Samples may be 
collected infrequently and with a patchy distribution due to the 
challenging conditions presented in collecting such data. 
Understanding how a metric reacts to a small sample size is 
invaluable in metric choice. The robust nature of the LD and 
NG to smaller sample sizes and their high performance in the 
comparison task makes them appealing for analysis. The data 
presented here indicated that LD and NG performed well with 
a sample size of three or less, while TT and RD should not be 
considered as a metric for analysis until a sample size of four or 
more is available. 

Here, we have presented a robust understanding of 
which metric should be preferentially employed in studies 
involving the comparison of individual- or group-specific 
vocalisations, such as signature whistles. The success in 
identifying individual/geographical 
sequences has implications for assessing population struc- 
ture, song transmission and dialect similarity, particularly 
for populations where rapid song changes occur. For 
example, the analysis of humpback whale song presented 
here was able to identify population origin despite rapid 
song dynamics (Garland ef al. 2011, 2012, 2013). We sug- 
gest that the LD can be applied to any level within a com- 
plex display, but suggest future studies strive for the lowest 
level sequence within the hierarchy (i.e. sequence of units 
or phrases), to increase the amount of information directly 
compared and thus encapsulated by the sequence. 


variations in vocal 
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The LD method provides a metric to compare sequence con- 
tent and organisation (and thus songs) within and among mul- 
tiple individuals, populations, years and locations. In 
particular, transmission of humpback whale song is largely cul- 
tural, and the level and rate of change remains unparalleled in 
any other non-human animal as complete population-wide 
changes are replicated in multiple populations at a vast geo- 
graphical scale (Garland et al. 2011). Thus, fundamental ques- 
tions in animal culture, vocal learning and cultural evolution 
can be explored using humpback whale song as a model, and 
with the help of the LD metric. Further, the evolution of com- 
plex vocal labels (i.e. signature whistles) and the underlying 
cognitive abilities required for such evolution are extremely 
important in understanding the evolution of vocal complexity 
(Janik 2014). Robust metrics that capture the information 
encoded in the sequences with the highest fidelity are thus 
required to address these far-reaching evolutionary questions. 
We suggest the LD should be utilised in such comparison stud- 
ies in preference to Markov and information theory-based 
models. 


Conclusions 


The LD (binary metric) significantly outperformed all other 
tested metrics in our comparative analysis of animal acoustic 
sequences. It provides a direct measure of pairwise differences 
among sequences, instead of a comparison of aggregate simi- 
larity. NGs (Markov chains) were the second most successful 
metric; the underlying issue that the tested species’ vocalisa- 
tions may be governed by non-Markovian dynamics and the 
consistent success of the LD metric suggests NGs should 
always be a second choice. Given the inherent interest in the 
origins of human language and the evolution of signalling 
complexity, robust and reliable metrics that can capture the 
content and arrangement of the signal are essential to address 
these fundamental questions in animal communication and 
cultural evolution. 
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Fig. S1. Zipf plots of the vocabulary frequencies for each of the empiri- 
cal data sets. 
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